Add DTLS throughput benchmark tool and optimize send path by julek-wolfssl · Pull Request #10551 · wolfSSL/wolfssl

julek-wolfssl · 2026-05-28T14:07:42Z

Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected.

Optimize the send path exercised by the benchmark:

wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned.
internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData().

Also ignore the built dtls_bench binary in .gitignore.

Copilot

Pull request overview

This PR adds a new DTLS throughput benchmark under examples/benchmark/ and makes two optimizations in the DTLS send path to better measure (and reduce) per-record overhead in wolfSSL’s record layer and socket I/O glue.

Changes:

Add examples/benchmark/dtls_bench.c: a DTLS 1.2/1.3 throughput benchmark with cipher selection, plain-UDP baseline mode, and a client-side “sink send” mode.
Optimize DTLS send path by caching the SO_TYPE (datagram vs stream) probe in WOLFSSL_DTLS_CTX instead of calling getsockopt() on every send.
Optimize AEAD explicit-nonce construction by writing the record sequence number directly for suites where the explicit nonce is defined as the seq number, using a new read-only PeekSEQ() helper.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`wolfssl/internal.h`	Adds DTLS context fields for caching socket type probe results.
`tests/api.c`	Resets new DTLS context cache fields when copying SSL state in an API test helper.
`src/wolfio.c`	Changes datagram-vs-stream detection to cache `SO_TYPE` results.
`src/ssl.c`	Invalidates the DTLS socket-type cache when read/write fds are (re)assigned.
`src/internal.c`	Adds `PeekSEQ()` and uses it to derive AEAD explicit nonce from sequence number for applicable suites.
`examples/benchmark/include.am`	Adds `dtls_bench` to Automake build outputs.
`examples/benchmark/dtls_bench.c`	New DTLS benchmark tool implementation.
`.gitignore`	Ignores the built `examples/benchmark/dtls_bench` binary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-05-28T17:20:42Z

retest this please

github-actions · 2026-05-29T18:49:59Z

MemBrowse Memory Report

gcc-arm-cortex-m3

FLASH: .text +104 B (+0.1%, 121,009 B / 262,144 B, total: 46% used)

gcc-arm-cortex-m4

FLASH: .text +128 B (+0.1%, 198,318 B / 262,144 B, total: 76% used)

gcc-arm-cortex-m4-dtls13

FLASH: .text +64 B (+0.0%, 179,352 B / 1,048,576 B, total: 17% used)

gcc-arm-cortex-m4-openssl-compat

FLASH: .text +64 B (+0.0%, 765,740 B / 1,048,576 B, total: 73% used)

gcc-arm-cortex-m4-rsa-only

FLASH: .text +64 B (+0.0%, 322,168 B / 1,048,576 B, total: 31% used)

gcc-arm-cortex-m4-tls12

FLASH: .text +64 B (+0.1%, 121,741 B / 262,144 B, total: 46% used)

gcc-arm-cortex-m7

FLASH: .text +128 B (+0.1%, 198,318 B / 262,144 B, total: 76% used)
No memory changes detected for:
gcc-arm-cortex-m0plus
gcc-arm-cortex-m4-baremetal
gcc-arm-cortex-m4-crypto-only
gcc-arm-cortex-m4-min-ecc
gcc-arm-cortex-m4-pkcs7
gcc-arm-cortex-m4-pq
gcc-arm-cortex-m4-sp-math
gcc-arm-cortex-m4-tls13
gcc-arm-cortex-m7-pq
gcc-arm-cortex-m7-tls13
linuxkm-pie
linuxkm-standard
stm32-sim-stm32h753

julek-wolfssl · 2026-05-31T15:39:04Z

retest this please.

Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected. Optimize the send path exercised by the benchmark: - wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned. - internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData(). Also ignore the built dtls_bench binary in .gitignore.

julek-wolfssl · 2026-06-01T16:26:22Z

The jenkins failures are not related to this PR.

dtls_bench.c is built whenever DTLS and the example servers are enabled, including the cross-mingw-all-crypto multi-test scenario, which cross- compiles for Windows. It directly includes POSIX-only headers (<sys/socket.h>, <arpa/inet.h>, <netdb.h>, <net/if.h>) that mingw does not ship, so the build failed there. Gate the networking includes and the whole benchmark body behind a DTLS_BENCH_ENABLED check (WOLFSSL_DTLS, not USE_WINDOWS_API, not WOLFSSL_NO_SOCK). When the platform lacks POSIX BSD sockets, compile a small stub main() that reports the tool is unsupported, so the source tree still builds.

Fixed

dgarske

Skoll Code Review

Scan type: reviewOverall recommendation: COMMENT
Findings: 5 total — 5 posted, 0 skipped
5 finding(s) posted as inline comments (see file-level comments below)

Posted findings

[Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADED — src/wolfio.c:655-666
[Low] Single isDGram cache shared between rfd and wfd of potentially different socket types — src/wolfio.c:649-667
[Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order paths — src/internal.c:24809-24831
[Low] now_sec ignores clock_gettime failure, can return uninitialized time — examples/benchmark/dtls_bench.c:96-101
[Low] -z (sink send) silently ignored when combined with -s (server) — examples/benchmark/dtls_bench.c:parse_args

Review generated by Skoll

dgarske · 2026-06-08T22:54:12Z

    /* optvalue 'type' is of size int */
    XSOCKLENT length = (XSOCKLENT)sizeof(type);

+    if (dtlsCtx->isDGramCached)


🟠 [Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADED

Previously isDGramSock() was a pure read of the socket type. It now lazily writes dtlsCtx->isDGram and dtlsCtx->isDGramCached (bit-fields that share a storage unit with connected, and with processingPendingRecord when WOLFSSL_DTLS_CID is set). In a WOLFSSL_RW_THREADED build, EmbedSendTo (write thread) calls this on every send and EmbedReceiveFrom (read thread) calls it on the recvd == 0 path, so two threads can perform the first-time write concurrently. Writing adjacent bit-fields is a read-modify-write of the shared byte and is not atomic, so this is a data race (UB per the C memory model) on dtlsCtx. In practice it is benign because both threads write the same socket-type value and connected is not modified concurrently, but it can be reported by ThreadSanitizer CI and is a new race introduced by moving shared-state mutation into the hot I/O path. The rest of dtlsCtx peer state is already protected by peerLock precisely because these callbacks run concurrently.

Fix: Confirm intent for threaded builds; prefer computing the cache eagerly at fd assignment to remove the race from the data path entirely.

dgarske · 2026-06-08T22:54:13Z

+ * fixed for its lifetime, so the getsockopt(SO_TYPE) probe runs once per
+ * connection and is cached in dtlsCtx; the cache is invalidated wherever
+ * rfd/wfd is reassigned. */
+static int isDGramSock(WOLFSSL_DTLS_CTX* dtlsCtx, int sfd)


🔵 [Low] Single isDGram cache shared between rfd and wfd of potentially different socket types

The cache stores one isDGram flag per dtlsCtx, but it is consulted for both the read descriptor (dtlsCtx->rfd, via EmbedReceiveFrom) and the write descriptor (dtlsCtx->wfd, via EmbedSendTo). If an application sets rfd and wfd to different sockets of different types (legal via wolfSSL_set_read_fd/wolfSSL_set_write_fd), the first probe wins and the cached value is then incorrectly applied to the other descriptor. This is unusual for DTLS (rfd == wfd normally) and the cache is invalidated on each fd reassignment, so the common path is fine, but the original code re-probed the exact sfd passed in each time, whereas the cache is now keyed on the context rather than the descriptor.

Fix: Confirm the rfd==wfd (or same-type) assumption holds for all supported DTLS usages; document it next to the cache fields.

dgarske · 2026-06-08T22:54:13Z

-                ret = wc_RNG_GenerateBlock(ssl->rng, args->iv, args->ivSz);
-                if (ret != 0)
-                    goto exit_buildmsg;
+#ifdef HAVE_AEAD


🔵 [Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order paths

The new code writes the record sequence number into the explicit nonce via PeekSEQ(ssl, epochOrder, args->iv) instead of wc_RNG_GenerateBlock. In every standard configuration this written value is subsequently overwritten before transmission: the non-FIPS/FIPSv2+ path copies the cipher-generated IV over it (XMEMCPY(out, ssl->encrypt.nonce + AESGCM_IMP_IV_SZ, ...) at line ~20741), and the legacy-FIPS path overwrites args->iv with ssl->keys.aead_exp_IV (line ~24839). So the real effect of this change is removing the per-record RNG draw; the seq value itself is intentionally discarded. Note also that the actual AAD seq is written with CUR_ORDER inside writeAeadAuthData (line ~20695) while PeekSEQ here uses epochOrder, so the two would differ for DTLS SCR PREV_ORDER retransmits, but this is harmless precisely because the PeekSEQ-written bytes are discarded. This is correct but subtle; a future reader could mistakenly assume the on-wire explicit nonce equals PeekSEQ.

Fix: Keep the change; add a short clarifying comment to prevent future misreading and confirm the RNG-removal is the intended win.

dgarske · 2026-06-08T22:54:13Z

+#define DEFAULT_SOCKBUF    (8 * 1024 * 1024)
+
+/* Upper bound on per-record overhead the user must leave under MTU.
+ * DTLS 1.2 worst case: 13 (hdr) + 8 (explicit nonce) + 16 (AEAD tag) = 37.


🔵 [Low] now_sec ignores clock_gettime failure, can return uninitialized time

clock_gettime()'s return value is cast to void and ts is not initialized, so on the (rare) failure of CLOCK_MONOTONIC the function returns garbage from the stack, corrupting all throughput/duration math. This is benchmark-only code so impact is limited, but it is trivial to harden.

Fix: Initialize ts and check the return value, returning a sentinel on failure.

dgarske · 2026-06-08T22:54:13Z

+                break;
+            case 'b': c->sockBuf     = atoi(optarg); break;
+            case 'n': c->plainUdp    = 1; break;
+            case 'z': c->sinkSend    = 1; break;


🔵 [Low] -z (sink send) silently ignored when combined with -s (server)

The help text documents -z as client-only, but parse_args performs no validation when -z is passed together with -s. In server mode sinkSend is parsed and then never used, so a user invoking -s -z gets no warning that the flag had no effect. Minor usability nit in example code.

Fix: Emit a warning (or error) when -z is combined with -s so the no-op is visible to the user.

Copilot AI review requested due to automatic review settings May 28, 2026 14:07

Copilot started reviewing on behalf of julek-wolfssl May 28, 2026 14:07 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread src/wolfio.c

Comment thread examples/benchmark/include.am

Comment thread examples/benchmark/dtls_bench.c

Comment thread examples/benchmark/dtls_bench.c

julek-wolfssl self-assigned this May 28, 2026

julek-wolfssl force-pushed the dtls-perf-benchmark branch from 8d445f1 to 32c7f0b Compare May 28, 2026 17:02

julek-wolfssl marked this pull request as ready for review May 28, 2026 17:19

julek-wolfssl force-pushed the dtls-perf-benchmark branch 2 times, most recently from 7b5387d to 4068636 Compare May 29, 2026 16:27

julek-wolfssl force-pushed the dtls-perf-benchmark branch from 4068636 to 0d93481 Compare May 31, 2026 14:02

julek-wolfssl force-pushed the dtls-perf-benchmark branch from 0d93481 to 0c90533 Compare June 1, 2026 11:12

julek-wolfssl force-pushed the dtls-perf-benchmark branch from 0c90533 to 1e7d632 Compare June 1, 2026 13:32

julek-wolfssl assigned wolfSSL-Bot and unassigned julek-wolfssl Jun 1, 2026

douzzer added the Staged Staged for merge pending final test results and review label Jun 5, 2026

douzzer previously requested changes Jun 5, 2026

View reviewed changes

Comment thread examples/benchmark/dtls_bench.c

douzzer removed the Staged Staged for merge pending final test results and review label Jun 5, 2026

julek-wolfssl force-pushed the dtls-perf-benchmark branch from 1e7d632 to cae9827 Compare June 5, 2026 22:19

julek-wolfssl requested a review from douzzer June 5, 2026 22:20

dgarske reviewed Jun 8, 2026

View reviewed changes

dgarske assigned julek-wolfssl Jun 8, 2026

Conversation

julek-wolfssl commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MemBrowse Memory Report

Uh oh!

julek-wolfssl commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julek-wolfssl commented Jun 1, 2026

Uh oh!

Uh oh!

dgarske left a comment

Choose a reason for hiding this comment

Skoll Code Review

Posted findings

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot commented May 29, 2026 •

edited

Loading

julek-wolfssl commented May 31, 2026 •

edited

Loading